Classification of document page images based on visual similarity of layout structures
نویسندگان
چکیده
Searching for documents by their type or genre is a natural way to enhance the eeectiveness of document retrieval. The layout of a document contains a signiicant amount of information that can be used to classify a document's type in the absence of domain speciic models. A document type or genre can be deened by the user based primarily on layout structure. Our classiication approach is based on \visual similarity" of the layout structure by building a supervised classiier, given examples of the class. We use image features, such as the percentages of text and non-text (graphics, image, table, and ruling) content regions, column structures, variations in the point sizes of fonts, the density of content area, and various statistics on features of connected components which can be derived from class samples without class knowledge. In order to obtain class labels for training samples, we conducted a user relevance test where subjects ranked UW-I document images with respect to the 12 representative images. We implemented our classiication scheme using the OC1, a decision tree classiier, and report our ndings.
منابع مشابه
Page Layout Classification Technique for Biomedical Documents
The structural layout information of scanned document pages is valuable for a wide range of document processing applications such as automatic document searching, document delivery and automated data entry. This paper describes the classification of scanned document pages into different classes of physical layout structures. The page layout classification technique proposed in this paper uses a...
متن کاملDocument page similarity based on layout visual saliency: Application to query by example and document classification
In this paper we propose to define a measure of visual similarity to compare different pages in a corpus. This measure is based on the analysis of the visual layout saliency of the page composition. This similarity is computed using both the document layout and characteristics of the text itself. The text characterization uses statistical features derived from textural primitives. Our purpose i...
متن کاملDocument page similarity based on layout visual saliency: application to query by example and document classificat - Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on
In this paper we propose to define a measure of visual similarity to compare different pages in a corpus. This measure is based on the analysis of the visual layout saliency of the page composition. This similarity is computed using both the document layout and characteristics of the text itself. The text characterization uses statistical features derived from textural primitives. Our purpose i...
متن کاملClassification of Document Page Images
Searching in a large heterogeneous collection of scanned document images often produces uncertain results in part because of the size of the collection and the lack of an ability to focus queries appropriately. Searching for documents by their type is a natural way to enhance the effectiveness of document retrieval in the workplace [2], and a such system is proposed in [4]. The goal of our work...
متن کاملDocument Image Layout Comparison and Classification
This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes reg...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000